behaviour policy
Safe and Efficient Off-Policy Reinforcement Learning
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(lambda), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of off-policyness; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyse the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q* without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(lambda), which was an open problem since 1989. We illustrate the benefits of Retrace(lambda) on a standard suite of Atari 2600 games.
Safe and Efficient Off-Policy Reinforcement Learning
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(lambda), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of off-policyness; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyse the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q* without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(lambda), which was an open problem since 1989. We illustrate the benefits of Retrace(lambda) on a standard suite of Atari 2600 games.
Safe and Efficient Off-Policy Reinforcement Learning
Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q
- North America > United States (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
A Proofs
In this proof, we use the notion of weighted exchangeability as defined in Section 3.2 of [27]. A.2 Proof of Proposition 4.2 The following proof is an adaptation of [14, Proposition 1] to our setting. To get from (32) to (33), we use Assumption 2 and Markov's inequality. B.1 Further comments on the differences between [14] and COPP In this subsection, we elaborate on the differences between our work and [14]. As mentioned in in the main text, given that we are integrating out the action in Eq. 7, we are essentially able to use the full dataset when constructing the CP intervals.
AVG-DICE: Stationary Distribution Correction by Regression
Che, Fengdi, Chan, Bryan, Ma, Chen, Mahmood, A. Rupam
Off-policy policy evaluation (OPE), an essential component of reinforcement learning, has long suffered from stationary state distribution mismatch, undermining both stability and accuracy of OPE estimates. While existing methods correct distribution shifts by estimating density ratios, they often rely on expensive optimization or backward Bellman-based updates and struggle to outperform simpler baselines. We introduce AVG-DICE, a computationally simple Monte Carlo estimator for the density ratio that averages discounted importance sampling ratios, providing an unbiased and consistent correction. AVG-DICE extends naturally to nonlinear function approximation using regression, which we roughly tune and test on OPE tasks based on Mujoco Gym environments and compare with state-of-the-art density-ratio estimators using their reported hyperparameters. In our experiments, AVG-DICE is at least as accurate as state-of-the-art estimators and sometimes offers orders-of-magnitude improvements. However, a sensitivity analysis shows that best-performing hyperparameters may vary substantially across different discount factors, so a re-tuning is suggested.
- North America > Canada > Alberta (0.14)
- North America > United States > Massachusetts (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Learning on One Mode: Addressing Multi-Modality in Offline Reinforcement Learning
Wang, Mianchu, Jin, Yue, Montana, Giovanni
Offline reinforcement learning (RL) enables policy learning from static datasets, without active environment interaction, making it ideal for high-stakes applications like autonomous driving and robot manipulation [Levine et al., 2020, Ma et al., 2022, Wang et al., 2024a]. A key challenge in offline RL is managing the discrepancy between the learned policy and the behaviour policy that generated the dataset. Small discrepancies can hinder policy improvement, while large discrepancies push the learned policy into uncharted areas, causing significant extrapolation errors and poor generalisation [Fujimoto et al., 2019, Yang et al., 2023]. Addressing these challenges, existing research has proposed various solutions. Conservative approaches penalise actions that stray into out-of-distribution (OOD) regions [Yu et al., 2020, Kumar et al., 2020], while others regularise the policy by minimising its divergence from the behaviour policy, ensuring better fidelity to the dataset [Fujimoto and Gu, 2021, Wu et al., 2019].
- North America > United States > Montana (0.05)
- North America > United States > New York (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)
- Transportation > Ground > Road (0.34)
- Information Technology (0.34)
- Automobiles & Trucks (0.34)
Streetwise Agents: Empowering Offline RL Policies to Outsmart Exogenous Stochastic Disturbances in RTC
Soni, Aditya, Das, Mayukh, Parayil, Anjaly, Ghosh, Supriyo, Shandilya, Shivam, Cheng, Ching-An, Gopal, Vishak, Khairy, Sami, Mittag, Gabriel, Hosseinkashi, Yasaman, Bansal, Chetan
The difficulty of exploring and training online on real production systems limits the scope of real-time online data/feedback-driven decision making. The most feasible approach is to adopt offline reinforcement learning from limited trajectory samples. However, after deployment, such policies fail due to exogenous factors that temporarily or permanently disturb/alter the transition distribution of the assumed decision process structure induced by offline samples. This results in critical policy failures and generalization errors in sensitive domains like Real-Time Communication (RTC). We solve this crucial problem of identifying robust actions in presence of domain shifts due to unseen exogenous stochastic factors in the wild. As it is impossible to learn generalized offline policies within the support of offline data that are robust to these unseen exogenous disturbances, we propose a novel post-deployment shaping of policies (Streetwise), conditioned on real-time characterization of out-of-distribution sub-spaces. This leads to robust actions in bandwidth estimation (BWE) of network bottlenecks in RTC and in standard benchmarks. Our extensive experimental results on BWE and other standard offline RL benchmark environments demonstrate a significant improvement ($\approx$ 18% on some scenarios) in final returns wrt. end-user metrics over state-of-the-art baselines.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Washington > King County > Renton (0.04)
- (4 more...)
- Information Technology > Communications (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient
Wang, Wenlong, Dusparic, Ivana, Shi, Yucheng, Zhang, Ke, Cahill, Vinny
Model-based reinforcement learning (RL) offers a solution to the data inefficiency that plagues most model-free RL algorithms. However, learning a robust world model often demands complex and deep architectures, which are expensive to compute and train. Within the world model, dynamics models are particularly crucial for accurate predictions, and various dynamics-model architectures have been explored, each with its own set of challenges. Currently, recurrent neural network (RNN) based world models face issues such as vanishing gradients and difficulty in capturing long-term dependencies effectively. In contrast, use of transformers suffers from the well-known issues of self-attention mechanisms, where both memory and computational complexity scale as $O(n^2)$, with $n$ representing the sequence length. To address these challenges we propose a state space model (SSM) based world model, specifically based on Mamba, that achieves $O(n)$ memory and computational complexity while effectively capturing long-term dependencies and facilitating the use of longer training sequences efficiently. We also introduce a novel sampling method to mitigate the suboptimality caused by an incorrect world model in the early stages of training, combining it with the aforementioned technique to achieve a normalised score comparable to other state-of-the-art model-based RL algorithms using only a 7 million trainable parameter world model. This model is accessible and can be trained on an off-the-shelf laptop. Our code is available at https://github.com/realwenlongwang/drama.git.
- Europe > Ireland > Leinster > County Dublin > Dublin (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
A cGAN Ensemble-based Uncertainty-aware Surrogate Model for Offline Model-based Optimization in Industrial Control Problems
This study focuses on two important problems related to applying offline model-based optimization to real-world industrial control problems. The first problem is how to create a reliable probabilistic model that accurately captures the dynamics present in noisy industrial data. The second problem is how to reliably optimize control parameters without actively collecting feedback from industrial systems. Specifically, we introduce a novel cGAN ensemble-based uncertainty-aware surrogate model for reliable offline model-based optimization in industrial control problems. The effectiveness of the proposed method is demonstrated through extensive experiments conducted on two representative cases, namely a discrete control case and a continuous control case. The results of these experiments show that our method outperforms several competitive baselines in the field of offline model-based optimization for industrial control.